Processing Homonyms in the Kana-to-Kanji Conversion

نویسندگان

  • Masahito Takahashi
  • Tsuyoshi Shinchu
  • Kenji Yoshimura
  • Kosho Shudo
چکیده

This p a p e r I)roI)oses two new methods to ident i fy the correct meaning of J apanese h o n m n y m s in t ex t based on tile i l o u n : v e r b co occ I l r r enc ( ~, ill a s e n t e n c e which (:an be ob ta ined easi ly from corpora . The first m e t h o d uses the n e a r co -occur rence da ta se ts , which are cons t ruc t ed f rom the above (:o-occurrence re la t ion, to select the most fe~Lsible word among h o m o n y m s in the s(:ol)e of a sea> tence. Ti le se(:ond uses the flu' cooccurrence da ta se ts , which are cons t r u t t e d d y n a m i c a l l y fl 'om the n e a r cooccurrence da ta s e t s in the course of processing inpu t sentences , to select the most feasible word among h o m o n y m s ill the s(:ope of a sequence of sentences. An expe r imen t of k a n a t o k a n f i ( p h o n o g r a n > t o ideograph) conversion has shown tha t the convers ion is carr ied out at the accuracy ra te of 79.6% per word by the first me thod . This accuracy ra te of our me thod is 7.4% higher than t ha t of the o rd ina ry m e t h o d based on the word occurrence frequency. 1 I n t r o d u c t i o n Process ing hontonynLs, i.e. ident i fy ing the correct meaning of h o m o n y m s in text , is one of the most i m p o r t a n t phases of k a n a t o k a n j i conversion, curren t ly the most popu l a r m e t h o d for int)ut t ing J apanese cha rac te r s in to a compu te r . Recently, severM new me thods fi)r processing homonyms , based on neural ne tworks(Kol)ayashi ,1992) or tile co-occurrence re la t ion of words(Yamamot<),1992) , have been proposed . These me thods apl)ly to the co-occurrence re la t ion of words not only in a s e n t e n c e b u t a l so ill a s e q u e n c e of sentellC(~s. I t a p p e a r s impra<:ticat)le to p repa re a neural network for co-oecurren(:e d a t a large e n o u g h to h a n dle 50,000 to 100,000 J a p a n e s e words. In this 1)aper, we p ropose two uew me thods for process ing J apanese h o m o n y m s based on the (:ooccurrence re la t ion be tween a noun and a verb ill a s e n t e n c e . W e have defined two co-occurrence d a t a sets. One is a set of nouns ~ c o m p a n i e d by a case mark ing par t ic le , e~:h e lement of which has a set of co-occurr ing w~rbs in a sentence. The o ther is a set of verbs accompan ied by a case mark ing p a r t M e , each e lement of which has a set of cooccurr ing nouns in a sentence. We (:all these tv~o co-occl l r rence d a t a sets n e a r c o o c c u r r e n c e da ta se ts . Thereaf te r , we app ly the d a t a sets to the 1)ro<:essing of holuonylns. Two s t ra tegies are used to al>l)roach the problem. The first uses the near co -occur rence da ta se t s to select the most feasible word among homonyms in the scope of a sentence. The aim is to eva lua te the possible existen<-e of a n e a r co -occurrence re la t ion , or co-occurrence rela t ion be tweeu a noun and a verb wi thin a sentence. T h e second ewfluates the poss ibh ' exis tence of a f a r co -occurrence re la t ion , referr ing to a cooccurrence re la t ion among words in different sentences. Th is is achieved by cons t ruc t ing f i t r cooccurrence da ta se t s from n e a r co -occurrence da ta s e t s in the course of process ing inpu t sentences. 2 C o o c c u r r e n c e d a t a s e t s The near co -occur rence da ta se t s are (lefined. The first near c o o c c u r r e n c e da ta se t is the set EN ........ each e lement of which(n) is a t r ip le t consist ing of a noun, a case mark ing p a r t M e , and a set of w~rl)s which co-occur wi th t ha t noun and l )a r tMe pa i r in a sentence, as follows: n = ( n o u n , p a r t i c l e , {(Vl, kl ), (v2, ~;2),"" }) Ill this descr ip t ion , p a r t i c l e is a J apanese case mark ing par t ic le , such as 7)'-'; (nomina t ive case), (ac(:usative case), or tC (da t ive case), v i ( i = 1 , 2 , . . ) is a verb, and k i ( i ---1 , 2 , . . . ) is the frequency of occurren(:e of the combina t ion n o u n , p a r t i c l e and vl, which is del ;ermined in the course of cons t ruc t ing EN ...... . fi 'om corpora . The following are examl)les of the e lements of EN ...... .. (g[~ (rain), 7)~ (nominative case), { (~7~ ( fa l l ) ,10 ) , ( lk~2e(s top) ,3 ) , . . } ) ( ~ ( r a i n ) , ~ (accusative case), {(~JT~-~Xa (take precautions) ,3) , . . } )

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Large Scale Collocation Data and Their Application to Japanese Word Processor Technology

Word processors or computers used in Japan employ Japanese input method through keyboard stroke combined with Kana (phonetic) character to Kanji (ideographic, Chinese) character conversion technology. The key factor of Kana-to-Kanji conversion technology is how to raise the accuracy of the conversion through the homophone processing, since we have so many homophonic Kanjis. In this paper, we re...

متن کامل

An Automatic Translation System Of Non-Segmented Kana Sentences Into Kanji-Kana Sentences

This paper p r e s e n t s t h e a l g o r i t h m s t o s o l v e t h e two main problems compr ised i n t he au tomat ic Kana-KanJi t r a n s l a t i o n sys tem, in which the i n p u t s e n t e n c e s in Kana a re t r a n s l a t e d i n t o o r d i n a r y Japanese s e n t e n c e s i n Kanj i and Kana : t he s e g m e n t a t i o n o f non-segmented s e n t e n c e s i n t o Bunsetsu and...

متن کامل

Discriminative Method for Japanese Kana-Kanji Input Method

The most popular type of input method in Japan is kana-kanji conversion, conversion from a string of kana to a mixed kanjikana string. However there is no study using discriminative methods like structured SVMs for kana-kanji conversion. One of the reasons is that learning a discriminative model from a large data set is often intractable. However, due to progress of recent researches, large sca...

متن کامل

Kana-Kanji Conversion System with Input Support Based on Prediction

1 I n t r o d u c t i o n TOSHIBA developed the world's first Japanese word processor in 1978. Unlike languages based on an alphabet , Japanese uses /,housands of Ica nji characters of varying comp]exity. Hence, l,o arrange all of l~a'~:ii chm'acl;ers on keyboard is; difficult. On the other hand, kana dlaracters which are phonetic scripl,s of Japanese have 83 variations; these can be arranged o...

متن کامل

Distinct role of spatial frequency in dissociative reading of ideograms and phonograms: An fMRI study

It has been proposed that distinct neural circuits are activated by reading Japanese ideograms (Kanji) and phonograms (Kana). By measuring high-density event-related potentials, we recently reported that spatial frequency (SF) information is responsible for the dissociation between Kanji and Kana reading. In particular, we found close links between Kana and low SF (LSF) information and between ...

متن کامل

Implicit and explicit processing of kanji and kana words and non-words studied with fMRI.

Using functional magnetic resonance imaging (fMRI), we investigated the implicit language processing of kanji and kana words (i.e., hiragana transcriptions of normally written kanji words) and non-words. Twelve right-handed native Japanese speakers performed size judgments for character stimuli (implicit language task for linguistic stimuli), size judgments for scrambled-character stimuli (impl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996